Understanding Data: VIMO Analysis#

In this section, I will demonstrate key differences between Python and Google Sheets when it comes to cleaning a dataset and conducting a VIMO analysis. This analysis aims to identify Valid, Invalid, Missing, and Outlier variables CITE STATISTICS CANANADA VIDEO.

For the purpose of this assignment, I will not do a complete VIMO analysis. Instead, let’s find some summary statistics about our data.

Let’s start by importing pandas and our dataset using the code we saw in CROSS LIST GETTING DATA SECTION. We will also print the first five rows.

import pandas as pd

permits = pd.read_csv('https://raw.githubusercontent.com/jsmarier/course-datasets/main/ottawa-building-permits-2021.csv')

permits.head()
ST # ROAD PC WARD PLAN LOT CONTRACTOR BLG TYPE MUNICIPALITY DESCRIPTION D.U. VALUE FT2 PERMIT# APPL. TYPE ISSUED DATE
0 372 BRETTONWOOD RIDGE K2T0H8 Ward 4 4M1502 38 POOLARAMA Single Kanata Install an enclosure for a semi inground pool ... 0 0 0 1807192 Pool Enclosure 2021-Oct-06
1 11 BEGGS CRT K0A2E0 Ward 21 4M1579 1 CONTRACTOR UNKNOWN Single Rideau Intall an enclosure for a semi-inground pool (... 0 0 0 2008197 Pool Enclosure 2021-Jul-06
2 5500 CEDAR DR K4M1B4 Ward 20 NaN 18 BOISVERT, JOHN Single Osgoode Install an enclosure for a hot tub (rear yard) 0 0 0 2010072 Pool Enclosure 2021-Jul-26
3 1458 CYRVILLE RD K1B3L9 Ward 11 NaN NaN TIKKUN CONSTRUCTION Retail Gloucester Tenant fit-up to a 1 storey mixed use building... 0 75,000 3767 2100001 Construction 2021-Jan-04
4 360 LAURIER AVE W K1P1C8 Ward 14 2996 28 & 29 CONTRACTOR UNKNOWN Office Old Ottawa Interior alterations on the 2nd floor of an 11... 0 44,486 753 2100002 Construction 2021-Jan-04

Using Python’s .info() and .describe()#

As we saw in the previous section, we can use the permits.info() to show the data type of each column, as well as the number of non-null variables:

permits.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14076 entries, 0 to 14075
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ST #           14076 non-null  object
 1   ROAD           14076 non-null  object
 2   PC             12754 non-null  object
 3   WARD           14070 non-null  object
 4   PLAN           10928 non-null  object
 5   LOT            7524 non-null   object
 6   CONTRACTOR     14067 non-null  object
 7   BLG TYPE       14074 non-null  object
 8   MUNICIPALITY   14076 non-null  object
 9   DESCRIPTION    14076 non-null  object
 10  D.U.           14076 non-null  int64 
 11  VALUE          14076 non-null  object
 12  FT2            14076 non-null  int64 
 13  PERMIT#        14076 non-null  int64 
 14  APPL. TYPE     14076 non-null  object
 15  ISSUED DATE    14076 non-null  object
dtypes: int64(3), object(13)
memory usage: 1.7+ MB

We can also use .describe() to find summary statistics about the numeric columns in our dataset.

permits.describe()
D.U. FT2 PERMIT#
count 14076.000000 1.407600e+04 1.407600e+04
mean 1.948778 4.706217e+03 2.105802e+06
std 8.012956 4.543035e+04 4.400099e+03
min -18.000000 -1.201940e+05 1.807192e+06
25% 0.000000 5.400000e+01 2.102822e+06
50% 0.000000 1.695000e+03 2.105912e+06
75% 2.000000 5.157250e+03 2.108803e+06
max 404.000000 5.002896e+06 2.111792e+06

Sum & Average (Mean) Functions in Both Python & Google Sheets#

Statistics such as the sum and average (mean) can also be calculated by writing the proper Python code. Let’s use the FT2 column as an example. It shows the square footage of the different building and demolition projects listed in the dataset.

Here are the functions expressed in \(\LaTeX\) and Python. PUT REFERENCE TO MODULE 7 AND THE RESOURCES FROM ASSIGNMENT 7.

Sum#

\(\LaTeX\)

(1)#\[\Sigma = x_1+x_2 + \dots + x_n\]

Python

permits['FT2'].sum()

Average#

\(\LaTeX\)

(2)#\[\bar{x} = \frac{x_1 + x_2 + \dots + x+n}{n}\]

Python

permits['FT2'].mean()

Warning

Since I have worked with this dataset in the past, I know that there are some duplicates. These duplicates were created by city staff on purpose in order to individually list the various street addresses attached to the same permit number, for example in the case of row houses. Therefore, at this point, our sum and average functions will include these duplicates.

Let’s run the two Python functions in code cells to see the results.

permits['FT2'].sum()
66244707
permits['FT2'].mean()
4706.2167519181585

To find the sum and average of the FT2 column in Google Sheets, we can use the SUM and AVERAGE functions. They look something like this:

=SUM(M2:M14077)
=AVERAGE(M2:M14077)

We need to manually tell Google Sheets to do the required math on cells M2 through M14077, since FT2 is column N, and it features 14077 rows, including the headings.

_images/2022-12-11_ass-8_sum-average-functions-google-sheets.png

Fig. 6 Screen capture showing the SUM and AVERAGE functions used in Google Sheets.#

“Column Stats” in Google Sheets#

Alternatively, we can select the entire column we are interested in, and then click on Data > Column stats.

_images/2022-12-11_ass-8_column-stats-sum.png

Fig. 7 Screen capture of the columns stats information box.#

Whether we use .describe(), Python functions, Google Sheets functions, or Google Sheets’s column stats option, the summary statistics for column M (FT2) are the same.

Here too, I would argue that Google Sheets is perhaps easier to use. However, Python presents the advantage of quickly generating a table with summary statistics by writing a simple line of code.

Showing the Distribution With Altair#

We can also use Altair to generate plots showing the frequency distribution of the values in various columns. Let’s focus on WARD and BLG TYPE.

The code cells are hidden, but let’s remember that we first need to import the Altair library using this code:

import altair as alt

Also, since our dataset as more than 5000 rows, we need to include the following line of code, as demonstrated in the Altair documentation. We therefore see that both Google Sheets and Python have “shortcomings” when it comes to handling larger datasets.

alt.data_transformers.disable_max_rows()

** Using instructions from the modules for Python course and Viz course. ADD CITATIONS**

import altair as alt
alt.data_transformers.disable_max_rows()

alt.Chart(permits, width=600, height=450).mark_bar().encode(
    x=alt.X("WARD:N", sort="y", title="Ward"),
    y=alt.Y('count()', title="Number of Projects"))
C:\Users\marie\miniconda3\lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
alt.Chart(permits, width=600, height=450).mark_area().encode(
    x=alt.X("WARD:N", sort="y", title="Ward"),
    y=alt.Y('count()', title="Number of Projects"),
    color="BLG TYPE:N")

Warning

For an unknown reason, there seems to be a parsing error somewhere, which explains why I am getting an “undefined” category for the building type instead of a list of building types. I have already spend at least one hour trying to solve this issue. Given the scope of this assignment, I decided to leave this error as is. I hope this will nonetheless count as my second chart.